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Abstract 



We consider a minimization scheme based on the Householder trans- 
port operator for the Grassman manifold, where a point on the manifold is 
represented by an m x n matrix with orthonormal columns. In particular, 
we consider the case where m ^ n and present a method with asymp- 
totic complexity mn^. To avoid explicit parametrization of the manifold 
we use Householder transforms to move on the manifold, and present a 
formulation for simultaneous Householder reflections for S-orthonormal 
columns. We compare a quasi-Newton and nonlinear conjugate gradi- 
ent implementation adapted to the manifold with a projected nonlinear 
conjugate gradient method, and demonstrate that the convergence rate is 
significantly improved if the manifold is taken into account when designing 
the optimization procedure. 



that is, we attempt to minimize the real valued fmiction / of X e ]]j"ix" where 
m ^ n, subject to the constraint X-^'X — I, and with the computable derivative 
d/ (X). The constraint on X ensures that / has a minimum, but this minimum 
is not necessarily unique. The method we present requires that m > 2n. Ex- 
tending the method to cover cases where just m > n is, however, possible. 

A special property we assume of / is the homogeneity condition: /(X) = 
/(XQ), where Q is any n x n orthonormal matrix. This property means that 
/ only depends on the span of the columns of X. The set of subspaces that are 
spanned by the columns of orthonormal m x n matrices is called the Grassman 
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1 Introduction 



We consider the optimization problem 
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manifold, M. While the solution to JT]) is an equivalence class, we choose an 
arbitrary representative of the class since we are interested in the value of /. 
The closely related Stiefel manifold consists of the same problem without the 
homogeneity condition. While the Householder transformation is suitable for 
both the Grassman and Stiefel manifolds, the optimization method presented 
does not optimize with respect to the basis and is therefore suitable only for the 
Grassman manifold. 

In the applications we have in mind the evaluation of / and df is expensive. 
For this reason we cannot employ a high quality line search to decide the step 
length, and the optimization method must be robust. Furthermore, we will 
measure the number of evaluations of / and df necessary to obtain a solution. 
One iteration of the optimization procedure requires the two evaluations, once to 
evaluate the solution candidate and once to construct a quadratic approximation 
along the search direction. 

For computational reasons we choose an mx n matrix, X, as a representative 
of a point on A4 , and enforce the requirement 

X^X = I. (2) 

While this representation includes more degrees of freedom than strictly neces- 
sary, the approach is suitable for use in practice [TU] . 
We use the inner product for matrices 

(A,B) = trace(A^B), (3) 

and the tangent spaces at X satisfying X-^X = I 

{Z = XA + Y|Y^X = OandA'^ = -A}. (4) 

On the Grassman manifold the value of / depends only on the space spanned by 
the columns of X. We can therefore ignore the XA component of the tangent 
space, and denote 

TxM ={Y|Y^X = 0}. (5) 

On the Grassman manifold it is possible to substitute equivalence classes for the 
representatives we have chosen, but practical computations require us to always 
use a specific matrix. 

We also assume that we are given a direction W by the minimization method 
in which we want to move on the manifold. We project the direction on to Tx-M. 

by 

Y = (I - XX^)W. (6) 

The motivation for the problem under consideration comes from density func- 
tional theory (DFT) electronic structure calculations [171 EI] ■ We construct a 
simultaneous Householder operator that can be used to ensure that the opti- 
mization method naturally enforces the orthogonality constraint. This approach 
differs from several other approaches in that it does not solve the canonical elec- 
tron orbitals [13l [T8j [22l |4j [23l [21] , instead we only solve the electron density 
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Figure 1: Conceptual difference between reorthogonalization and Householder 
approach. In Subfigure a) the step is taken without regard to the manifold, 
and after the step is taken the new solution candidate X^+i is constructed by 
reorthogonalizing + Y^. Subfigure b) illustrates the case where the House- 
holder operator, H, constructs an update X^+i that immediately satisfies the 
orthogonality condition. 



that would be given by the orbitals. To obtain the canonical orbitals from the 
electron density a linear eigenvalue problem must then be solved in the space 
spanned by the columns of X. A similar approach using polynomial filtering can 
be found in [231 13] ■ It is also possible to solve a nonlinear eigenvalue problem 
instead of the minimization problem [501 [131 UHl HI] ■ 

In [To] a framework for optimization methods on the Stiefel and Grassmann 
manifolds is presented, while [7] discusses a Newton-like iteration scheme on a 
more general manifold. Univariate optimization methods for the Stiefel manifold 
is presented in [5] , where identity plus rank one Householder transforms are given 
as one possible choice for moving on the manifold. The choice of coordinates 
can also be based on a QR factorization and polar decompositions [51 [5] or Lie 
groups [14] . An overview of geometric numerical integration techniques can be 
found in [15] . 

First, we present a simultaneous Householder transformation in Section [2] 
that we can use to move on both the Stiefel and Grassman manifolds. Then 
in Section 121 we recall the method of steepest descent, the quasi- Newton (QN), 
and the nonlinear conjugate gradient (NLCG) methods adapted for use with the 
Householder operator. In Section|4lwe numerically demonstrate the method on 
a model problem that includes nonlinearities similar to a DFT problem. Finally, 
Section (5] presents the conclusion. 

2 Householder operator 

Wc ensure that X G during the solution process by using the Householder 
transformation to move from one solution candidate to the next. 
To do this we need to find an operator 



H(T)=I-2Q(r)Q(T)^, 



(7) 
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where 

Q(TfQ(T)=I Vr, 
and T is a parametrization of H such that 

H(0)X = X. 

This requirement leads to 

Q(0)^X==O. 

H is unitary, and if we let 

Xfe+i = H(r)Xfc, 
we obtain a sequence {X^,} that satisfies 



Xi.Xfe Vfc,T. 



The orthonormality requirement ([5]) on Q(r) leads to the constraint 



OT OT 
9Q 



0. 



We set the initial condition for by requiring that 

d 



(H(r)X) 



T = 



(8) 

(9) 
(10) 

(11) 

(12) 

(13) 
(14) 



where Y g 7x-^ is a projected direction given by the minimization method we 
choose to employ. Differentiating with respect to r, and using property ([T0|) . 
we obtain from (fTH) 



_2Q(0)^(0)^X = Y. 



We set 



Q(o) = V, 

where VR is the compact QR decomposition of Y, and choose 



dr 



(0) 



^XR."^. 



A solution satisfying Equation (fT3| and conditions (|9]) and (fT4|) is 
Q(t) = Qoexp|^7 







IR- 




(15) 
(16) 

(17) 
(18) 



Here Qo ~ [V X] and Q(r) corresponds to the first n columns of Q(r). 
Given any Q with orthonormal columns H constructed by ([7]) ensures that 
X|'_|_]^Xfc+i = I. For this reason we also consider a Householder operator con- 
structed from a second order expansion of the matrix exponential that has 
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subsequently been orthonormalized by the QR method to ensure that the or- 
thogonahty constraint is satisfied. This approach is similar to the orbital trans- 
formation but includes orthogonalization after every evaluation of the matrix 
exponential function [2 3) . To distinguish these from the basic algorithms we use 
AEQN and AENLCG to denote the approximate exponential versions. 

Remark: If to < 2n the requirement Y-^X = restricts the number of 
columns in Y to below n. In this case the size of the first block in Equation (|18p 
should be reduced accordingly. A similar modification must be made if Y is not 
full column rank. For the Householder transformation to work we must still 
have m > n. 

3 Descent methods with orthogonaUty constraints 

In this section we consider the method of steepest descent, a quasi-Newton 
method, and a nonlinear conjugate gradient method for minimization with or- 
thogonality constraints. We also present the Householder operator for an S- 
orthonormal basis, and combine this with the optimization methods. 

3.1 The method of steepest descent 

The method of steepest descent for the Stiefel manifold is also known as the 
projected gradient method [5]. At each step we simply set 

Yk = -<7{I - XfcXi')V/(Xfc). (19) 

The parameter cr > is almost redundant for the method of steepest descent, 
but will become important for the quasi-Newton methods presented later. 

To decide the step length we evaluate /(H(r^)Xfc), where is an estimate 
step length, and construct the quadratic approximation p(r) of /(H(T)Xfc). We 
then solve Tmin — argmin p{t) from the system 

p{0) = /(H(O)X), (20) 
piT^,) = /(H(r|)X), (21) 
p'(0) = (V/(H(0)X),Y). (22) 

This permits us to compute H(/3T,„in) and evaluate /(H(/3T„iin)X) as well as 
V/(H(/3T„iin)X), where /3 is an underrelaxation parameter. To ensure that we 
obtain a non-increasing iteration we choose the step length from the set 
{0, /3Tinin, t|}, such that we obtain the lowest value of / evaluated so far. If 
Tk = we set T^^-^ = 0.25 x r| and otherwise we set T^_^_l — min(|T,„in|, 2 r|). 

Remark: It turns out that often is an acceptable choice for step length, 
and we believe that it is possible to construct a completely line search free 
minimization method with adaptive step length [2]. However, it must be tuned 
for a real-world problem, and we will not explore this option here. 
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Figure 2: The eurvature of the manifold must be taken into account when the 
trial solution is updated. The update operator, H, corresponds to a reflection 
in the illustrated plane. 
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3.2 Vector transport on the manifold 

To reduce the number of iterations we use information from previous evaluations 
to improve the search direction. To this end we construct a transport operator, 
T : Txfc-A^ Txfc+i-A^, that moves tangent vectors at X^to the tangent space 
at Xfe+i. 

Any given vector, Zfc g M™^", associated with a candidate solution, X^, 
and search direction, V, can be decomposed into 

Zfc = XfcA + VB + UC, (23) 

and we demand that X|^U = V^U = 0, and U^U = I in addition to X|^V = 
which is satisfied by construction. For simplicity we assume that Zfc is such that 
U e M™x" and A, B, C e M"^". A and B can be computed by projecting Zfc 
onto Xfc and V respectively, while UC is for example the QR-decomposition of 
the remainder. In practice the decomposition is not explicitly constructed. 

The different parts of the decompostion ([25)) must be transported separately 
when the position is updated to Xfc+i = HXfc. The component spanned by Xfc 
is refiected correctly by H, however a reflection gives the wrong sign to the V 
component of Zfc. The final component U is orthogonal to both Xfc and V, 
and should not change when H is applied. The transport operation is shown in 
Figure [2] We construct the transport operator by separating the components 
of Zfc by projection and subsequent application of H. The transport operator 
therefore becomes 

HXfcX^-HVV^ + (I-XfcX^-VV^). (24) 

In practice we apply this for vectors Zfc e Tx^A^ which satisfy Z^Xfc = 0, and 
we can use the simplified transport operator 

T(t) =-H(t)VV'^ + (I- VV"^), (25) 

and the tangent vector corresponding to Zfc at Xfc+i is 

Zfc+i=T(Tfc)Zfcerx,+i7W. (26) 

3.3 S-orthonormal Householder and transport operator 

In practice X is often represented by a discretization that requires a generaliza- 
tion of the orthonormality constraint The generalized constraint is 

X^SX = I, (27) 

where S is symmetric and positive definite. This constraint arises for example 
as an overlap matrix in DFT or a mass matrix in finite element calculations. If 
S is sparse or has other structure that can be exploited it can be preferable not 
to change basis for the optimization procedure. We therefore also present the 
Householder transform and optimization procedure for the S-orthonormal case. 
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We can construct both the Householder transform and the transport operator 
using the same argument as for the regular orthonormal long as we 

account for S orthonormality (l?7l) . However, if S is full, and lacks exploitable 
structure, the operation SX has asymptotic complexity rri^n and overshadows 
the rest of the procedure. 

The projection of the direction of steepest descent onto the S-orthogonal 
manifold is 

Y = -cr(I - XX^S)V/(X) (28) 

instead of (IT^ and satisfies Y-^SX — 0. We must also use the S weighted com- 
pact QR decomposition to compute a factorization Y ~ VR where V^SV — I, 
and R is upper triangular. 

With these modifications, the Householder operator in ^ becomes 

Hs(t) =I-2Q(r)Q(Tf S, (29) 

where Q(t) is as in Equation (IT5| . and remains unchanged. The S-orthonormal 
transport operator corresponding to (|25p is 

Ts(t) = -Hs(t)VV^S + (I - VV^S). (30) 

3.4 The quasi-Newton method based on Householder trans- 
forms 

The method of steepest descent generally performs poorly if the minimum of the 
target function is at the bottom of a narrow valley. Newton's method solves this 
problem, but requires that the Hessian of the function is available to determine 
the search direction. When the Hessian is not available we can replace it with 
an approximation of the true inverse Hessian of the system to obtain a quasi- 
Newton method. 

We base our method on Broyden's second or bad update to construct the 
approximate inverse Hessian, Gfc, of / at X^. While Broyden's second update 
does not construct a symmetric approximation, or ensure that the approxima- 
tion is positive definite it is a robust choice for electronic structure calculations 
[m [TJ [2]. However, we must take into account that our vectors are actually 
]gmxn matrices, which will lead to a method identical to the generalized Broy- 
den update. The secant condition is then 

Gfc+iAFfe = AXfc, (31) 

where we project the orbital differences 

AXfe = (I - Xfe+iXi^+iS)(Xfc+i - Xfc), (32) 

and gradient differences 

AFfc = (I - Xfc+iX^+iS)V/(Xfc+i) - T(rfc)(I - XfcX^)V/(Xfc) (33) 
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onto Txfc+i-A^- The no change condition is now 

GfeZ = Gfe+iZ VZ:Z^AFfc = 0. (34) 

These conditions corresponds to the generahzed Broyden's second update for 
groups of size n. We can therefore use the generahzed update formula jll) 

Gfe+i = Gfe + (AXfe - GfeAFfc)(AF^SAFfe)-iAr^S. (35) 

As initial guess we use Go = cri. With this choice, the quasi-Newton method 
is identical with the method of steepest descent if we do not enforce any secant 
conditions (|3T|) . When secant conditions are enforced we can use a to control the 
influence of Gq compared to the information gained from the secant conditions. 
In general, the information gained from these is reliable, and therefore a should 
be smaU [Hll]. 

To construct the search direction we use 

Yfc - -Gfc(I - XfcX^S)V/(Xfc). (36) 

In practice, we do not store G^ as a full matrix. Instead we represent it as 
a low rank update. Details on recursive or low rank implementation of G^ can 
be found in [H [Hill [2]. 

We also limit the number of secant conditions used to construct G^. Each 
condition requires storage of two m x n matrices, AX and AF, and these ma- 
trices must be transported to Tx^A^ after each step. This is done with the 
transport operator, Ts(t), defined in Equation pop. As we demonstrate in 
Section 2] the first few secant conditions offer dramatic improvement over the 
method of steepest descent, but further secant conditions do not give the same 
benefit. For this reason we limit the secant conditions by a pre-determined 
history length, and simply discard older conditions. 

We use the same line search as the one presented in Section [XT] If, however 

(Yfc, S(I - XfeX^S)V/(Xfc)) > (37) 

then the proposed direction is not a descent direction. In this case we restart 
the optimization method and forget the secant history. If the line search returns 
the current point Xfe, we update Gfe but stay at Xfc. 

3.5 Nonlinear conjugate gradients 

The linear conjugate gradient (CG) method can be viewed as a optimization 
method for a quadratic problem. Several generalizations of the CG method 
have been presented to solve optimization problems that are not of quadratic 
form [19]. Below, we review a nonlinear CG method adapted to account for the 
curvature of the manifold [TU] . 

Given Xq which satisfies X^Xq = I, the gradient projected onto 7xo-^ is 

Yo = (I-XoX^S)V/(Xo), (38) 
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and the initial search direction is the direction of steepest descent 

Po = -Yo. (39) 

On the manifold the NLCG method then proceeds by minimizing / along the 
path defined by the search direction P^. In practice we evaluate / once along 
the search direction and minimize the quadratic approximation as in Section r3.1l 
The next candidate is chosen as the best evaluated step, r^, 

Xfc+i = H(Tfc)Xfc, (40) 

and the gradient and conjugate directions are transported to Txfc+i-A^ by T(Tfc) 
in Equation pop . The new gradient 

Yfe+i = (I - Xfc+iXj+iS)V/(X,+i), (41) 

and conjugate direction 

Pfc+i = -Yfc+i + 7fcTs(rfc)Pfc, (42) 
are then computed where 

(Yfc+i - Ts(Tfc)Yfc, Yfc+i) 

= ^;:y;^ ■ ^"'^ 

For comparison we also implement a projected NLCG (PNLCG) method. 
Instead of ensuring that orbital updates satisfy X^_|^-^Xfc+i = I we orthogonalize 
Xfe+i after every update with the QR method. The conjugate directions and 
gradient are not transported to Tx^+i-^i instead they are updated with the 
I - Xfc+iX^_^^S projector onto TSc^+iA^- 



4 Numerical experiments 

We use a two dimensional model problem with the condition X^SX = I to 
compare the projected NLCG method with the NLCG and QN methods where 
satisfaction the orthonormality condition is ensured by the update operator. 
The model problem is inspired by electronic structure theory, and corresponds 
to a three dimensional system constrained to two dimensions without exchange- 
correlation terms. 

The target function is [TTI 

/(X) = -itr((Si/2x)^LSi/2X) + v^n + in^Pn, (44) 

where L G ]gnixm jg ^j^g discretized Laplace operator, v g M™ the external 
potential, n e M™ the electron density, and Pn the Hartree potential. The 
electron density is 

n 

n, = ^((SV2x)o(Si/2x)),„ (45) 
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where o is the entrywise, or Hadamard, product. We use the overlap matrix 

TM M •■ 



1 
9^2 



M M ^ 
M M 



(46) 



where h is the one dimensional grid size and 

[4 1 



M = 



1 4 
1 



(47) 



This corresponds to the mass matrix of a finite element discretization with 
bilinear quadratic element. The calculations have also been performed with a 
symmetric and positive definite random matrix. It turns out that as long as S 
is well conditioned, it has only a small effect on the rate of convergence. 
To calculate the potentials we use 



N 

E 



R.7 



(48) 



where the sum is over the nuclei with charge Zj and position Rj . The position 
corresponding to the discretization point i is r^, and the parameter a is used to 
regularize the potential. P € R™x™ is similarly given by 



1 



(49) 



We solve the problem in the unit square with zero boundary conditions 
corresponding to an infinite potential well. We use a uniform finite difference 
discretization with m inner points to obtain a system where X g R™^". Here 
n corresponds to the number of electrons. As initial guess we use the solution 
of the quadratic problem using the first two terms of (|44|) . and choose Tq = 1.0 



to initialize the minimization procedure, cf. Equation (|20|) . 

As generators for the external potential we use two nuclei, where one is 
placed at the grid point closest to (I, I), and the other at the grid point closest 
to ( § , ^ ) • The off diagonal placement is chosen to break the symmetry of the 



system. We use 



ex = ||(I-XX^)V/(X) 



(50) 



to measure convergence, and consider the system converged when 



ex < 10- 



(51) 
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Figure 3: Iterations required for convergence as a function of the spatial degrees 
of freedom, m. For subfigure a) the external potential is generated by two 
nuclei, Zi — Z2 = and the parameters of the calculation are n = 6, 
/3 — 0.5, a ~ 10^"*, a = 2 X 10^^, and history length is 6. For subfigure b) 
Zi = A, Z2 — 3, and n = 7. Here NLCG corresponds to the nonlinear conjugate 
gradient method, PNLCG to the projected NLCG, QN to the quasi-Newton 
method and AEQN to the approximate exponent QN method. The number 
of iterations required for convergence is identical for the AENLCG and NLCG 
methods, where AENLCG is the approximate exponent NLCG method. 



At this point 

|/(XRef)-/(X)|«10-^ (52) 
where the reference solution has been calculated such that 

exnof < 10"' • (53) 

Figure |3] presents the iterations necessary for convergence for a six and seven 
electron system. These iterations roughly grows as the square root of the degrees 
of freedom for the NLCG methods. However, the projected NLCG method 
performs significantly worse than the NLCG and QN methods adapted for the 
manifold. It turns out that the AENLCG method requires the same number of 
iterations to converge as the NLCG method, while a small difference is visible 
between QN and AEQN methods. 

In Figure |4] the effect of weight, a, of the initial approximation of the in- 
verse Hessian and the underrelaxation, (3 are presented. From Figure [4] b) it is 
clear that a is particularly important for fast convergence of the quasi-Newton 
method. The effect of /3 is much smaller, and while /3 can in some cases improve 
convergence the effect of a is significantly more important. A low a results in 
slower convergence as long as the more aggressive parameter choice converges 
well. However, when the rate of convergence begins to suffer from the more 
aggressive parameter choice the rate of convergence can be improved by a more 
conservative choice. These results agree with earlier work [1], which indicate 
that the secant conditions offer reliable information of the electronic structure 
problem, while the initial approximation of G is less reliable. The history length 



12 




Figure 4: Iterations required for convergence of the QN method as a function of 
the spatial degrees of freedom, m. For the both figures the external potential is 
generated by two nuclei, Zi =3, Z2 =3, and the parameters of the calculation 
are n = 6, a ^ 2 x 10^"^ , and history length is 6. Unless otherwise indicated in 
the figure f3 — 0.5 and a ~ 10"'*. 



a) b) 




02468 10 2468 10 

History length History leirgth 



Figure 5: Iterations required for convergence as a function of history length 
for the QN method for a six and seven electron system for subfigure a) and 
b) respectively. Spatial degrees of freedom is 2500 and history length varies, 
parameters are otherwise identical to Figure [3l 
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of the QN method must also be sufficient for the method to perform well. This 
is illustrated in Figure [5j 

5 Conclusion 

We have presented a Householder update scheme which ensures that the columns 
remains orthogonal, and is suitable for both NLCG and QN methods. Further- 
more, the operator allows us to transport gradient information and construct 
secant conditions from previous evaluations of / to the tangent space of the 
best candidate solution. This approach eliminates the need to parametrize the 
manifold, and permits us to use standard linear algebra routines to update the 
solution candidate. 

We have demonstrated the methods numerically on a model problem in- 
spired by the electronic structure problem, and compared them to a projected 
NLCG method. Taking the underlying manifold into account significantly im- 
proves convergence rate of the optimization methods, and using a second order 
orthonormal approximate matrix exponent does not decrease performance of 
the QN or NLCG methods. 

The QN method is significantly improved by taking the first few secant 
conditions into account when constructing the approximation of the Hessian of 
/. However, for the secant condition history to improve convergence speed of the 
QN method the manifold must be taken into account. The update of the secant 
conditions is also based on the Householder operator that is used to update the 
solution candidate. While the performance of the QN method depends on the 
weight of the initial approximate Hessian the QN method performs well once 
the weight is correctly set. 

While the QN method is sensitive to the correct choice of cr, the performance 
of the NLCG method does not depend on parameter choice. Furthermore, the 
NLCG method is relatively simple to implement, and only requires a one step 
history. For these reasons we believe that the NLCG method is a good general 
purpose optimization method for electronic structure problems if the method is 
adapted to the manifold. 
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